NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Yu, Zhongzhi; Wang, Zheng; Fu, Yonggan; Shi, Huihong; Shaikh, Khalid; Lin, Yingyan Celine (July 2024, Proceedings of Machine Learning Research)

Full Text Available
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

You, Haoran; Shi, Huihong; Guo, Yipin; Lin, Yingyan Celine (September 2023, NeurIPS 2023 Conference Proceedings)

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed ShiftAddViT, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all MatMuls among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on (quadratic or linear) attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. In principle, the faster the experts run, the more input tokens they are assigned. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to 5.18x latency reductions on GPUs and 42.9% energy savings, while maintaining a comparable accuracy as original or efficient ViTs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddViT.
more » « less
Full Text Available
ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention

https://doi.org/10.1109/HPCA56546.2023.10071081

Dass, Jyotikrishna; Wu, Shang; Shi, Huihong; Li, Chaojian; Ye, Zhifan; Wang, Zhongfeng; Lin, Yingyan (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

https://doi.org/10.1145/3579371.3589115

Li, Sixu; Li, Chaojian; Zhu, Wenbo; Yu, Boyang; Zhao, Yang; Wan, Cheng; You, Haoran; Shi, Huihong; Lin, Yingyan (June 2023, The The IEEE/ACM International Symposium on Computer Architecture 2023 (ISCA 2023))

Full Text Available
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

https://doi.org/10.1109/HPCA56546.2023.10071027

You, Haoran; Sun, Zhanyi; Shi, Huihong; Yu, Zhongzhi; Zhao, Yang; Zhang, Yongan; Li, Chaojian; Li, Baopu; Lin, Yingyan (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs’ self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD’s algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at https://github.com/GATECH-EIC/ViTCoD.
more » « less
Full Text Available

Search for: All records